Final Project: A Simple, Memory-Efficient, Bayesian Part-of-Speech Tagger with constant time udpate

نویسندگان

  • David Stein
  • Brian Cass
چکیده

In this project we explore a Bayesian part-of-speech (POS) tagging technique with a focus on low memory profile and computational demands. We achieve this by representing our beliefs about a word and its corresponding part-of-speech as a probability density function (PDF) and a confidence value instead of a tag. By computing trigrams and bigrams as combinations of parts-of-speech instead of combinations of words, we reduce the memory requirement to the size of generating n-gram priors, and demonstrate a linear solution in the size of the vocabulary. By utilizing confidence metrics, we can achieve arbitrary accuracy by choosing to skip difficult words, and instead reducing search space for more sophisticated taggers by restricting possible options to a subset of possible tags in those cases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BUILDING AN EFFICIENT, SCALABLE, AND TRAINABLE PROBABILITY-AND-RULE- BASED PART-OF-SPEECH TAGGER OF HIGH ACCURACY by

This project is aimed to build an efficient, scalable, portable, and trainable part-of-speech tagger. Using 98% of Penn Treebank-3 as the training data, it builds a raw tagger, using Bayes’ theorem, a hidden Markov model, and the Viterbi algorithm. After that, a reinforcement machine learning algorithm and contextual transformation rules were applied to increase the tagger’s accuracy. The tagge...

متن کامل

Part-of-Speech Tagging of Dutch with MBT, a Memory-Based Tagger Generator

We present a part of speech tagger (morphosyntactic disambiguator) for Dutch, constructed by means of the Memory-Based Tagger generation method. In this approach, inductive learning methods are used to derive a tagger, lexicon and unknown word category guesser fully automatically from a tagged example corpus. Advantages of the approach are (i) fast tagger development time without linguistic eng...

متن کامل

MBT: A Memory-Based Part of Speech Tagger-Generator

We introduce a memory based approach to part of speech tagging Memory based learning is a form of supervised learning based on similarity based reasoning The part of speech tag of a word in a particular context is extrapolated from the most similar cases held in memory Supervised learning approaches are useful when a tagged corpus is available as an example of the desired output of the tagger B...

متن کامل

An efficient memory-based morphosyntactic tagger and parser for Dutch

We describe TADPOLE, a modular memory-based morphosyntactic tagger and dependency parser for Dutch. Though primarily aimed at being accurate, the design of the system is also driven by optimizing speed and memory usage, using a trie-based approximation of k-nearest neighbor classification as the basis of each module. We perform an evaluation of its three main modules: a part-of-speech tagger, a...

متن کامل

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011